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Abstract 

The difficulty of developing reliable parallel soft- 
ware is generating interest in deterministic environ- 
ments, where a given program and input can yield only 
one possible result. Languages or type systems can en- 
force determinism in new code, and runtime systems can 
impose synthetic schedules on legacy parallel code. To 
parallelize existing serial code, however, we would like a 
programming model that is naturally deterministic with- 
out language restrictions or artificial scheduling. We 
propose deterministic consistency, a parallel program- 
ming model as easy to understand as the "parallel assign- 
ment" construct in sequential languages such as Perl and 
JavaScript, where concurrent threads always read their 
inputs before writing shared outputs. DC supports com- 
mon data- and task-parallel synchronization abstractions 
such as fork/join and barriers, as well as non-hierarchical 
structures such as producer/consumer pipelines and fu- 
tures. A preliminary prototype suggests that software- 
only implementations of DC can run applications writ- 
ten for popular parallel environments such as OpenMP 
with low (< 10%) overhead for some applications. 

1 Introduction 

For decades, the "gold standard" in multiprocessor 
programming models has been sequentially consistent 
shared memory |25| with mutual exclusion [20|. Alter- 
native models, such as explicit message passing [29] or 
weaker consistency ifTTl . usually represent compromises 
to improve performance without giving up "too much" 
of the simplicity and convenience of sequentially con- 
sistent shared memory. But are sequential consistency 
and mutual exclusion really either simple or convenient! 

In this model, we find that slight concurrency errors 
yield subtle heisenbugs [27 . 28 1 and security vulnerabil- 
ities ll34l . Data race detection |[T6l[30l or transactional 
memory [19 32 1 can help ensure mutual exclusion, but 
even "race-free" programs may have heisenbugs Q. 
Heisenbugs result from nondeterminism in general, a 
realization that has inspired new languages that ensure 
determinism through communication constraints [33 1 or 
type systems [7 |. But to parallelize the vast body of se- 
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Figure 1 : Deterministic versus sequential consistency 

quential code for new multicore systems, we would like 
a programming model that is simple, convenient, deter- 
ministic, and compatible with existing languages. 

To this end, we propose a new memory model called 
deterministic consistency or DC. In DC, concurrent 
threads logically share an address space but never see 
each others' writes, except when they synchronize ex- 
plicitly and deterministically. To illustrate DC, consider 
the "parallel assignment" operator in many sequential 
languages such as Python, Perl, Ruby, and JavaScript, 
with which one may swap two variables as follows: 



x, y 



y, x 



This construct implies no actual parallel execution: 
the statement merely evaluates all right-side expressions 
(in some order) before writing their results to the left- 
side variables. Now consider a "truly parallel" analog, 
using Hoare's notation for fork/join parallelism 12011 : 

{x := y} // {y := x} 

This statement forks two threads, each of which reads 
one variable and then writes the other; the threads then 
synchronize and rejoin. As Figure [TJ illustrates, under 
sequential consistency, this parallel statement may swap 
the variables or overwrite one with the other, depend- 
ing on timing. Making each thread's actions atomic, by 
enclosing the assignments in critical sections or transac- 
tions, eliminates the swapping case but leaves a nonde- 
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terministic choice between x overwriting y and y over- 
writing x. How popular would the former "parallel as- 
signment" construct be if it behaved in this way? Deter- 
ministic consistency, in contrast, reliably behaves like a 
parallel assignment: each thread reads all inputs before 
writing any shared results. 

Like release consistency [17|, DC distinguishes or- 
dinary reads and writes from synchronization operations 
and classifies the latter into acquires and releases, which 
determine at what point one thread sees (acquires) re- 
sults produced (released) by another thread. DC en- 
sures determinism by requiring that (1) program logic 
uniquely pairs each acquire with a matching release, 
(2) only an intervening acquire/release pair makes one 
thread's writes visible to another thread, and (3) acquires 
handle conflicting writes deterministically. Unlike most 
memory models, reads never conflict with writes in DC: 
the swapping example above contains no data race. A 
natural way to understand DC — and one way to imple- 
ment it — is as a distributed shared memory |[Tl l24ll in 
which a release explicitly "transmits" a message con- 
taining memory updates, and the matching acquire op- 
eration "receives" and integrates these updates locally. 

DC supports not only block-structured synchro- 
nization abstractions such as the fork/join, barrier, 
and task constructs of OpenMP (6), but also non- 
hierarchical synchronization patterns such as dynamic 
producer/consumer graphs and inter-thread queues. 
DC can emulate nondeterministic synchronization con- 
structs in existing parallel code via techniques such as 
deterministic scheduling [3,4,121, but for new or newly 
parallelized code, we develop deterministic alternatives 
for common idioms such as pipelines and futures. A pro- 
totype in progress promises to be flexible and efficient 
enough for a variety of parallel applications. 

Section|2]defines DC at a low level, and Section[3]ex- 
plores its use in high-level environments like OpenMP. 
Section|4]outlines implementation issues, Section|5]dis- 
cusses related work, and Section|6]concludes. 

2 Deterministic Consistency 

Since others have eloquently made the case for deter- 
ministic parallelism [7, 27], we will take its desirability 
for granted and focus on deterministic consistency (DC). 
This section defines the basic DC model and its low- 
level synchronization primitives, leaving the model's 
mapping to high-level abstractions to the next section. 

2.1 Denning Deterministic Consistency 

As in release consistency (RC) ifTTl l24l . DC sepa- 
rates normal data accesses from synchronization op- 
erations and classifies the latter into release, where a 
thread makes recent state changes available for use by 
other threads, and acquire, where a thread obtains state 
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Figure 2: Example synchronization trace for three 
threads with labeled and matched release/acquire pairs 

changes made by other threads. A thread performs a re- 
lease when forking a child thread or leaving a barrier, 
for example, and an acquire when joining with a child or 
entering a barrier. As in RC, synchronization operations 
in DC are sequentially consistent relative to each other, 
and these synchronization operations determine when a 
normal write in one thread must become visible to a nor- 
mal read in another thread: namely, when an intervening 
chain of acquire/release pairs connects the two accesses 
in a "happens-before" synchronization relation. 

While RC relaxes the constraints of sequential con- 
sistency ll25l . allowing an even wider range of nondeter- 
ministic orderings, DC in turn tightens RC's constraints 
to permit only one unique execution behavior for a given 
parallel program. DC ensures determinism by adding 
three new constraints to those of RC: 

1. Program logic must uniquely pair release and ac- 
quire operations, so that each release "transmits" 
updates to a specific acquire in another thread. 

2. One thread's writes never become visible to another 
thread's reads until mandated by synchronization: 
i.e., writes propagate "as slowly as possible." 

3. If two threads perform conflicting writes to the 
same location, the implementation handles the con- 
flict deterministically at the relevant acquire. 

Constraint 1 makes synchronization deterministic by 
ensuring that a release in one thread always interacts 
with the same acquire in some other thread, at the same 
point in each thread's execution, regardless of execu- 
tion speeds. A program might in theory satisfy this 
constraint by specifying each synchronization opera- 
tion's "partner" explicitly through a labeling scheme. If 
each thread has a unique identifier T, and we assign 
each of T's synchronization actions a consecutive inte- 
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ger N, then a (T, N) pair uniquely names any synchro- 
nization event in a program's execution. The program 
then invokes synchronization primitives of the form 
acquire (T r , N r ) and release (T a , N a ) , where 
(T r , N r ) names the acquire's partner release and 
vice versa. Figure|2]illustrates a 3-thread execution trace 
with matched and labeled acquire/release pairs. We sug- 
gest this scheme only to clarify DC: explicit labeling 
would be an unwelcome practical burden, and Section[3] 
discusses more convenient high-level abstractions. 

Constraint 2 makes normal accesses deterministic by 
ensuring that writes in a given thread become visible to 
reads in another thread at only one possible moment. Re- 
lease consistency already requires a write by thread T\ 
to become visible to thread T2 no later than the moment 
T 2 performs an acquire directly or indirectly following 
Ti's next release after the write. RC permits the write 
to become visible to T2 before this point, but DC re- 
quires the write to propagate to T2 at exactly this point. 
By delaying writes "as long as possible," DC ensures 
that non-conflicting normal accesses behave determinis- 
tically while preserving the key property that makes RC 
efficient: it keeps parallel execution as independent as 
possible subject to synchronization constraints. 

DCs third constraint affects only programs with data 
races. If both threads in FigureQ]wrote to the same vari- 
able before rejoining, for example, DC requires the join 
to handle this race deterministically. Since data races 
usually indicate software bugs, one response is to throw 
a runtime exception. Other behaviors, e.g., prioritizing 
one write over the other, would not affect correct pro- 
grams but may be less helpful with buggy code. 

2.2 Why DC is Deterministic 

To clarify why the above rules adequately ensure deter- 
ministic execution in spite of arbitrary parallelism, we 
briefly sketch a proof of DCs determinism. 

Theorem: A parallel program whose sequential frag- 
ments execute deterministically, and whose memory ac- 
cess and synchronization behavior conforms to the rules 
in Section lzTl yields at most one possible result. 

Proof Sketch: Assume each synchronization opera- 
tion explicitly names its "partner" as described above. 
Suppose we implement DC by accumulating memory 
"diffs" and passing them at synchronization points atop 
a message-passing substrate, as in distributed shared 
memory lfTl l24l . Assume the substrate provides an un- 
limited number of buffered message channels, each with 
a unique name of the form (T r , N r ,T a , N a ). When a 
thread T r invokes a release (T a ,N a ) operation la- 
beled (T r ,N r ), T r sends all diffs it has accumulated 
so far on channel (T r ,N r ,T a ,N a ), Similarly, when 
thread T a invokes an acquire (T r ,N r ) operation la- 
beled (T a ,N a ), it receives a set of diffs on channel 



(T r , N r ,T a , N a ) and applies those it does not already 
have. Since each channel (T r , N r ,T a , N a ) is used by 
only one sender T r and one receiver T a , the resulting 
system forms a Kahn process network [23 1, and DCs 
determinism follows from that of Kahn networks. 

3 High-level Synchronization 

We are developing DOMP, a variant of OpenMP 1 6 1 with 
deterministic consistency. DOMP retains OpenMP's 
language neutrality and convenience, supporting most 
OpenMP constructs except for fundamentally nondeter- 
ministic ones, and extending OpenMP to support general 
reductions and non-hierarchical dependency structures. 

Fork/Join: OpenMP's foundation is its parallel 
construct, which forks multiple threads to execute a par- 
allel code block and then rejoins them. Fork/join paral- 
lelism maps readily to DC, as shown in Figure [3j a): on 
fork, the parent releases to an acquire at the birth of each 
child; on join, the parent acquires the final results each 
child releases at its death. OpenMP's work-sharing con- 
structs, such as parallel for loops, merely affect each 
child thread's actions within this fork/join model. 

Barrier: At a barrier, each thread releases to each 
other thread, then acquires from each other thread, as 
in Figure 0b). Although we view an rt-thread barrier 
as n — 1 releases and acquires per thread, DOMP avoids 
this n 2 cost using "broadcast" release/acquire primitives, 
which are consistent with DC as long as each release 
matches a well-defined set of acquires and vice versa. 

Ordering: OpenMP's ordered construct orders a 
particular code block within a loop by iteration while 
permitting parallelism in other parts. DOMP imple- 
ments this construct using a chain of acquire/release 
pairs among worker threads, as shown in Figure0c). 

Reductions: OpenMP's reduction attributes and 
atomic constructs enable programs to accumulate 
sums, maxima, or bit masks efficiently across threads. 
OpenMP unfortunately supports reductions only on 
simple scalar types, leading programmers to serial- 
ize complex reductions unnecessarily via ordered or 
critical sections or locks. All uses of these serializa- 
tion constructs in the NAS Parallel Benchmarks ETI im- 
plement reductions, for example. DOMP therefore pro- 
vides a generalized reduction construct, by which a 
program can specify a custom reduction on pairs of vari- 
ables of any matching types, as in this example: 

#pragma omp reduction (a : al , b : bl , c : cl ) 
{ a += al; b = max(b,bl); 

if (cl. score > c. score) c = cl; } 

DOMP accumulates each thread's partial results in 
thread-private variables and reduces them at the next join 
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Figure 3: Mapping of High-level Synchronization Operations to Acquire/Release Pairs 



or barriar via combining trees, improving both conve- 
nience and scalability over serialized reduction. 

Tasks: OpenMP 3.0's task constructs express a form 
of fork/join parallelism suited to dynamic work struc- 
tures. Since DC rules prevent a task from seeing any 
writes of other tasks until it completes and synchronizes 
at a barrier or taskwait, DOMP eliminates OpenMP's 
risk of subtle bugs if one task uses shared inputs that are 
freed or go out of scope in a concurrent task. 

DOMP extends OpenMP with explicit task objects, 
with which a taskwait construct can name and syn- 
chronize with a particular task instance independently 
of other tasks, in order to express futures ifTHl or non- 
hierarchical dependency graphs lfl5l deterministically: 

omp_t ask my t ask; 
#pragma omp task(mytask) 

{ ... task code ... } 
. . . other tasks . . . 
#pragma omp taskwait (mytask) 

Mutual exclusion: Unlike ordered, which specifies 
a particular sequential ordering, mutual exclusion facil- 
ities such as critical sections and locks imply an 
arbitrary, nondeterministic ordering. Mutual exclusion 
violates Constraint 1 in Section 12.11 because it permits 
multiple acquire/release pairings, as illustrated in Fig- 
ure[3jd). While DOMP could emulate mutual exclusion 
via deterministic scheduling, we prefer to focus on de- 
veloping deterministic abstractions to replace common 
uses of mutual exclusion, such as general reductions. 

Flush: Some OpenMP programs implement custom 
synchronization structures such as pipelines using the 
flush (memory barrier) construct in spin loops. Like 
mutual exclusion, DOMP omits support for such con- 
structions, in favor of expressing dependency graphs 
such as pipelines deterministically using task objects. 

4 Implementing DC 

We have built an early user space prototype implement- 
ing DC with a pthreads-like fork/join API. The proto- 
type encouragingly shows less than 10% overhead on 



the coarse-grained PARSEC benchmarks [5| Blacksc- 
holes and Swaptions. Finer-grained benchmarks such as 
Streamcluster currently show high overheads, but many 
optimization opportunities remain. The rest of this sec- 
tion outlines key challenges and opportunities in imple- 
menting deterministic consistency, for both shared mem- 
ory multithreaded programs and multiprocess systems. 

4.1 Shared Memory Challenges 

Memory Access Isolation: Since DC requires one 
thread's writes to remain invisible to a second thread 
until the two threads synchronize, the threads must ef- 
fectively execute in separate "workspaces" between syn- 
chronization events. Virtual memory and write-sharing 
techniques like those used to implement lazy release 
consistent distributed shared memory [1| should ap- 
ply to DC. Memory accesses may also be isolated via 
instruction-level rewriting [3], possibly reducing the cost 
of synchronization operations at the expense of adding 
overhead to all ordinary memory accesses. Hardware 
support M12II17II could mitigate the performance cost of 
isolation, but is unlikely to appear in commodity hard- 
ware unless software-based approaches first demonstrate 
deterministic parallelism to be viable and compelling. 

Shared Resources: Shared resources in current envi- 
ronments implicitly introduce nondeterminism through 
mutual exclusion: calling ma Hoc () concurrently in 
multiple threads may yield different pointers depending 
on execution timing, for example, and the file descrip- 
tor number returned by a call to Unix's open ( ) may 
have similar timing dependencies on other threads' file 
descriptor operations. The malloc ( ) problem may be 
addressed by assigning each thread a separate virtual 
memory address range and allocation pool from which 
to satisfy malloc ( ) requests; such an allocator may 
also benefit scalability. The file descriptor table prob- 
lem might be addressed by using higher-level equiva- 
lents such as f open ( ) that do not imply mutual exclu- 
sion. These approaches do not address shared resources 
outside the application process, however, such as reads 
and writes to shared files in an external file system. 
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4.2 Beyond Shared Memory 

While we have focused on the intra-process shared 
memory abstraction, DC may also be applicable at the 
system level for state shared among processes. Standard 
operating systems, for example, commonly give all pro- 
cesses sequentially consistent access to a globally shared 
file system (though network file systems often relax con- 
sistency somewhat). This design yields the same prob- 
lems of nondeterminism and heisenbugs at inter-process 
level that we see within multithreaded programs: we find 
often that a large software source tree builds reliably un- 
der a sequential 'make' but fails nondeterministically 
under a parallel 'make - j ' command, for example. 

In place of sequential consistency, an OS might pro- 
vide a deterministically consistent file system to pro- 
cesses, enabling a multi-process computation to run de- 
terministically even as processes share state by reading 
and writing files. If a parallel make forks off two com- 
piler instances running in parallel, for example, each 
compiler would execute in its own private virtual copy 
of the file system until completion; the system would 
then reconcile the . o files produced by each compiler 
into a single directory once both compilers complete. 

There will always be shared resources "outside the 
reach" of any deterministic environment, whose use will 
introduce nondeterminism into the program: for exam- 
ple, I/O requests arriving at a network server from its 
clients. In such cases the only solution may be to ac- 
cept some nondeterminism, log nondeterministic inputs 
to enable later replay, or avoid their use entirely. 

5 Related Work 

DC conceptually builds on release consistency ifTTl and 
lazy release consistency ll24ll . which relax sequential 
consistency's ordering constraints to increase the inde- 
pendence of parallel activities. DC retains these inde- 
pendence benefits, additionally providing determinism 
by delaying the propagation of any thread's writes to 
other threads until required by explicit synchronization. 

Race detectors 021 130) can detect certain heisen- 
bugs, but only determinism eliminates their possibil- 
ity. Language extensions can dynamically check deter- 
minism assertions in parallel code II 1 0113 11 , but heisen- 
bugs may persist if the programmer omits an assertion. 
SHIM ifPfl [T31 |33"1 provides a deterministic message- 
passing programming model, and DPJ QUI enforces de- 
terminism in a parallel shared memory environment via 
type system constraints. While we find language-based 
solutions promising, parallelizing the huge body of ex- 
isting sequential code will require parallel programming 
models compatible with existing languages. 

DMP 0[T2) uses binary rewriting to execute exist- 
ing parallel code deterministically, dividing threads' ex- 
ecution into fixed "quanta" and synthesizing an artifi- 



cial round-robin execution schedule. Since DMP is ef- 
fectively a deterministic implementation of a nondeter- 
ministic programming model, slight input changes may 
still reveal schedule-dependent bugs. Grace [4| runs 
fork/join-style programs deterministically using virtual 
memory techniques. These systems still pursue sequen- 
tial consistency as an "ideal" and rely on speculation 
for parallelism: if a thread reads a variable concurrently 
written by another, as in the "swap" example in Sec- 
tion[T] one thread aborts and re-executes sequentially. A 
partial exception is DMP-B Q, which weakens consis- 
tency within a parallel execution quantum. DC, in con- 
trast, keeps threads fully independent between program- 
defined synchronization points, never requires specu- 
lation or rollback, and imposes no artificial execution 
schedules prone to accidental perturbation. 

Replay systems can log and reproduce particular ex- 
ecutions of conventional nondeterministic programs, for 
debugging |TT][26) or intrusion analysis |[l3]|22j. The 
performance and space costs of logging nondeterminis- 
tic events usually make replay usable only "in the lab," 
however: if a bug or intrusion manifests under deploy- 
ment with logging disabled, the event may not be sub- 
sequently reproducible. In a deterministic environment, 
any event is reproducible provided only that the original 
external inputs to the computation are logged. 

As with deterministic release consistency, transac- 
tional memory (TM) systems [19,32] isolate a thread's 
memory accesses from visibility to other threads except 
at well-defined synchronization points, namely between 
transaction start and commit/abort events. TM offers no 
deterministic ordering between transactions, however: 
like mutex-based synchronization, transactions guaran- 
tee only atomicity, not determinism. 

6 Conclusion 

Building reliable software on massively multicore pro- 
cessors demands a predictable, understandable program- 
ming model, a goal that may require giving up sequential 
consistency and mutual exclusion. Deterministic con- 
sistency provides an alternative parallel programming 
model as simple as "parallel assignment," and supports 
existing languages and synchronization abstractions. 
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